Search CORE

4 research outputs found

Parallel Sentence Mining by Constrained Decoding

Author: Bogoychev Nikolay
Chen Patrick
Heafield Kenneth
Kirefu Faheem
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Crossref

Edinburgh Research Explorer

The University of Edinburgh’s Submissions to the WMT19 News Translation Task

Author: Bawden Rachel
Birch-Mayne Alexandra
Bogoychev Nikolay
Germann Ulrich
Grundkiewicz Roman
Kirefu Faheem
Miceli Barone Antonio
Publication venue
Publication date: 12/07/2019
Field of study

The University of Edinburgh participated in the WMT19 Shared Task on News Translation in six language directions: English-to-Gujarati, Gujarati-to-English, English-to-Chinese, Chinese-to-English, German-to-English, and English-to-Czech. For all translation directions, we created or used back-translations of monolingual data in the target language as additional synthetic training data. For English-Gujarati, we also explored semi-supervised MT with cross-lingual language model pre-training, and translation pivoting through Hindi. For translation to and from Chinese, we investigated character-based tokenisation vs. sub-word segmentation of Chinese text. For German-to-English, we studied the impact of vast amounts of back-translated training data on translation quality, gaining a few additional insights over Edunov et al. (2018). For English-to-Czech, we compared different pre-processing and tokenisation regimes.Comment: To appear in the Proceedings of WMT19: Shared Task Paper

arXiv.org e-Print Archive

ZENODO

Edinburgh Research Explorer

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Author: Bañón Marta
Chen Pinzhen
Esplà-Gomis Miquel
Forcada Mikel
Haddow Barry
Heafield Kenneth
Hoang Hieu
Kamran Amir
Kirefu Faheem
Koehn Philipp
Ortiz-Rojas Sergio
Pla Leopoldo
Ramírez-Sánchez Gema
Sarrías Elsa
Strelec Marek
Thompson Brian
Waites William
Wiggins Dion
Zaragoza Jaume
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems

Crossref

University of Strathclyde Institutional Repository

Edinburgh Research Explorer